1 Introduction

Peer-to-peer (P2P) lending was a phenomenon less than ten years ago, exploding in popularity by offering a break from traditional banking. Individuals flocked to these alternative credit markets to finance their small business ventures, home purchases, and to consolidate debt. Although direct P2P lending has undergone changes over recent years, it remains a viable option for borrowers and investors.

The global P2P lending market is anticipated to grow from $84 billion in 2021 to $706 billion by 2030, according to figures from Precedence Research.1 This analysis takes a closer look at the mechanics of P2P lending to gain a better understanding of what considerations are factored into decisions to apply for, issue, and provide financing via P2P platforms. Analysis and statistical testing identifies credit underwriting policy and borrowers’ failure to fully pay as variables of interest that should be considered in future in-depth analyses.

1.1 What is P2P Lending?

P2P lending is the provision of financing without a traditional bank as the source of funds; it is, like it sounds, peers lending money to their peers. Instead of banks, online lending platforms provide a service that connects willing lenders, or investors, with individuals seeking to borrow funds. Historically, these investors have been predominantly private individuals seeking alternate forms of investments, wherein they receive the interest earned on the money they loan out.

Borrowers, on the other hand, are connected to feasible funding that they might not have otherwise been able to attain. Many borrowers participating in P2P lending did or would have experienced difficulties qualifying for traditional loans from banks. This perception of higher risk among the lenders can often translate into higher interest rates. P2P platforms screen borrowers and set rates and terms but it is ultimately up to the lender whether they will provide the funds.

1.2 Introducing LendingClub

The P2P market was dominated by LendingClub during the early rise of P2P lending, and it remains a leader in the industry. It makes money by charging borrowers an origination fee, charging investors a service fee, and selling loans in secondary markets. LendingClub’s typical annual percentage rate (APR) is between 5.99% and 35.89% while the origination fee of 1% to 6% is taken off the top of loans. Borrowers on LendingClub typically have good-to-excellent credit (700 or higher credit score) and a low debt-to-income ratio.

2 Exploratory Data Analysis or EDA

Our exploratory data analysis will closely adhere to the below 9-step checklist presented in Chapter 4 of The Art of Data Science.2

  1. Formulate our question
  2. Read in our data
  3. Check the packaging
  4. Look at the top and the bottom of your data
  5. Check your “n”s
  6. Validate with at least one external data source
  7. Make a plot
  8. Try the easy solution first
  9. Follow up

2.1 Our Data

Our dataset contains over 9,500 observations of loan data from LendingClub from between 2007 and 2015. We obtained the dataset from Kaggle here: https://www.kaggle.com/datasets/urstrulyvikas/lending-club-loan-data-analysis

Our work is stored on our team GitHub here: https://github.com/jschild01/JMB_DATS_6101

Below are the variables in the dataset and their accompanying definitions as supplied by Kaggle:

Variable Definition
credit.policy 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose The purpose of the loan (takes values creditcard, debtconsolidation, educational, majorpurchase, smallbusiness, and all_other).
int.rate The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment The monthly installments owed by the borrower if the loan is funded.
log.annual.inc The natural log of the self-reported annual income of the borrower.
dti The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico The FICO credit score of the borrower.
days.with.cr.line The number of days the borrower has had a credit line.
revol.bal The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths The borrower’s number of inquiries by creditors in the last 6 months.
delinq.2yrs The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
not.fully.paid Whether the borrower will be fully paid or not.

2.2 SMART Question

Our analysis will explore things such as income-to-debt ratios, credit score, interest rates, and delinquencies among direct P2P borrowers in an attempt to understand the risks and opportunities associated with P2P lending. Specifically, we intend to examine the impact that these variables have on who received loans and who defaulted on their loans between 2007 and 2015.

We will seek to answer few of the following questions:

  1. What variable or variables, if any, have an impact on if the person meets the credit underwriting criteria? How strong is that impact?
  2. What variable or variables, if any, have an impact on if the person fully repays the loan? How strong is that impact?
  3. Do borrowers who meet the credit underwriting criteria have a lower chance of not fully repaying the loan? If so, how big of a difference is it, and is it statistically significant?

2.3 First Look

For our analysis we will use the ezids, tidyverse, corrplot, scales, gridExtra, expss, knitr, kableExtra, broom, and purr libraries. Our dataset contains 9578 rows of data with 14 columns and is structured like this:

Name Class Length Frequency
credit.policy numeric 9578 1=7710, 0=1868
purpose character 9578 debt_consolidation=3957, all_other=2331, credit_card=1262, home_improvement=629, small_business=619, major_purchase=437, educational=343
int.rate numeric 9578 0.1253=354, 0.0894=299, 0.1183=243, 0.1218=215, 0.0963=210, 0.1114=206, 0.08=198, 0.1287=197, 0.1148=193, 0.0859=187, Other values=7276
installment numeric 9578 317.72=41, 316.11=34, 319.47=29, 381.26=27, 662.68=27, 156.1=24, 320.95=24, 188.02=23, 334.67=23, 669.33=23, Other values=9303
log.annual.inc numeric 9578 11.00209984=308, 10.81977828=248, 10.30895266=224, 10.59663473=224, 10.71441777=221, 11.22524339=196, 11.15625052=165, 10.77895629=149, 10.91508846=147, 11.08214255=146, Other values=7550
dti numeric 9578 0=89, 10=19, 0.6=16, 6=13, 12=13, 13.16=13, 15.1=13, 19.2=13, 8.21=12, 10.8=12, Other values=9365
fico numeric 9578 687=548, 682=536, 692=498, 697=476, 702=472, 707=444, 667=438, 677=427, 717=424, 662=414, Other values=4901
days.with.cr.line numeric 9578 3660=50, 3630=48, 3990=46, 4410=44, 3600=41, 2550=38, 4080=38, 1800=37, 3690=37, 4020=35, Other values=9164
revol.bal numeric 9578 0=321, 255=10, 298=10, 682=9, 346=8, 182=6, 1085=6, 2229=6, 1=5, 6=5, Other values=9192
revol.util numeric 9578 0=297, 0.5=26, 0.3=22, 47.8=22, 73.7=22, 0.1=21, 3.3=21, 0.2=20, 0.7=20, 1=20, Other values=9087
inq.last.6mths numeric 9578 0=3637, 1=2462, 2=1384, 3=864, 4=475, 5=278, 6=165, 7=100, 8=72, 9=47, Other values=94
delinq.2yrs numeric 9578 0=8458, 1=832, 2=192, 3=65, 4=19, 5=6, 6=2, 7=1, 8=1, 11=1, Other values=1
pub.rec numeric 9578 0=9019, 1=533, 2=19, 3=5, 4=1, 5=1
not.fully.paid numeric 9578 0=8045, 1=1533

By examining the structure of our data, we can see that there is only one character variable which like a factor, and some of the numeric variables look like logicals.

Here we can see the top and bottom rows of our dataset to get a better feel for the data. This will help us better understand the values in our dataset and how to most effectively deal with them.

Head
credit.policy purpose int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid
1 debt_consolidation 0.1189 829.10 11.3504 19.48 737 5639.958 28854 52.1 0 0 0 0
1 credit_card 0.1071 228.22 11.0821 14.29 707 2760.000 33623 76.7 0 0 0 0
1 debt_consolidation 0.1357 366.86 10.3735 11.63 682 4710.000 3511 25.6 1 0 0 0
1 debt_consolidation 0.1008 162.34 11.3504 8.10 712 2699.958 33667 73.2 1 0 0 0
1 credit_card 0.1426 102.92 11.2997 14.97 667 4066.000 4740 39.5 0 1 0 0
Tail
credit.policy purpose int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid
0 all_other 0.1461 344.76 12.1808 10.39 672 10474.000 215372 82.1 2 0 0 1
0 all_other 0.1253 257.70 11.1419 0.21 722 4380.000 184 1.1 5 0 0 1
0 debt_consolidation 0.1071 97.81 10.5966 13.09 687 3450.042 10036 82.9 8 0 0 1
0 home_improvement 0.1600 351.58 10.8198 19.18 692 1800.000 0 3.2 5 0 0 1
0 debt_consolidation 0.1392 853.43 11.2645 16.28 732 4740.000 37879 57.0 6 0 0 1

The top and bottom rows of our dataset indicate the data is structured in an acceptable way and that our variables match up with the values for each column.

According to the Kaggle site where we got this dataset from, there are 9,578 rows and 14 columns, which matches what we have. The site also shows that there is no missing data. We can verify that by adding the total number of missing cells in the dataset, which is 0, and check the total number of null cells, which is 0. We can also check if the observations are unique, and we see that all 9578 rows are unique.

This means the data looks good so far and we can now move on to the descriptive statistics part of our EDA.

2.4 Descriptive Statistics

Below are some descriptive statistics of the variables to help us better understand the data.
Table: Statistics summary.
credit.policy purpose int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid
Min Min. :0.000 Length:9578 Min. :0.0600 Min. : 15.67 Min. : 7.548 Min. : 0.000 Min. :612.0 Min. : 179 Min. : 0 Min. : 0.0 Min. : 0.000 Min. : 0.0000 Min. :0.00000 Min. :0.0000
Q1 1st Qu.:1.000 Class :character 1st Qu.:0.1039 1st Qu.:163.77 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0 1st Qu.: 2820 1st Qu.: 3187 1st Qu.: 22.6 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.0000
Median Median :1.000 Mode :character Median :0.1221 Median :268.95 Median :10.929 Median :12.665 Median :707.0 Median : 4140 Median : 8596 Median : 46.3 Median : 1.000 Median : 0.0000 Median :0.00000 Median :0.0000
Mean Mean :0.805 NA Mean :0.1226 Mean :319.09 Mean :10.932 Mean :12.607 Mean :710.8 Mean : 4561 Mean : 16914 Mean : 46.8 Mean : 1.577 Mean : 0.1637 Mean :0.06212 Mean :0.1601
Q3 3rd Qu.:1.000 NA 3rd Qu.:0.1407 3rd Qu.:432.76 3rd Qu.:11.291 3rd Qu.:17.950 3rd Qu.:737.0 3rd Qu.: 5730 3rd Qu.: 18250 3rd Qu.: 70.9 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
Max Max. :1.000 NA Max. :0.2164 Max. :940.14 Max. :14.528 Max. :29.960 Max. :827.0 Max. :17640 Max. :1207359 Max. :119.0 Max. :33.000 Max. :13.0000 Max. :5.00000 Max. :1.0000

We have an idea of what to expect for a few variables, such as interest rate and credit score, so we were able to test the dataset against some of our expectations to gauge its reliability. By inspecting the summary table, we can see that interest rates for the data are between 6% and 21.64% and credit scores range from 612 to 827. Although interest rates might seem to reach excessively high rates or credit scores too meager, the P2P market tended to consist of more risky loans. This aligns with our expectation and reinforces our confidence in the dataset.

The range of the utilization, or the percent of credit being used, is between 0% and 119%. Someone utilizing more than 100% of the credit available to them initially seemed erroneous; however, this can occur from technical error, creditors and collectors reporting at different date/times, borrowers opening and closing credit lines, or possibly when borrowers appear as authorized users of others’ credit lines. Regardless of the reason, only 27 loans within our dataset appear to exceed the standard maximum of 100% so we do not expect this to have a significant effect on our analysis, thereby allowing us to continue with our EDA.

We want to see a measure of dispersion/variation, namely standard deviation, for the numeric variables. The values are as follows:

Variable Standard Deviation
credit.policy 0.40
int.rate 0.03
installment 207.07
log.annual.inc 0.61
dti 6.88
fico 37.97
days.with.cr.line 2496.93
revol.bal 33756.19
revol.util 29.01
inq.last.6mths 2.20
delinq.2yrs 0.55
pub.rec 0.26
not.fully.paid 0.37

Here, we have used standard deviation as a measure of dispersion to understand the spread of each of the variables. While variables like credit.policy and not.fully.paid can only take one of 2 values, namely 0 and 1, variables like revol.bal,days.with.cr.line and installment have the highest standard deviations in that order.

We are satisfied with the data so far, so the next step is to begin visualizing it.

2.5 Initial Plots

Below is a histogram for each non-logical numeric variable to help us understand how the data is distributed:

One of the primary things we are looking for, normality, we can see with the log.annual.inc variable. Some other variables look at least somewhat normal, such as days.with.cr.line, fico, installment, and int.rate. We will make Q-Q plots later to get a better sense of the normality of these variables.

The revol.util variable is fairly flat, and the dti a little more rounded, but not normal looking. There are four variables where we clearly have outlier issues: delinq.2yrs, inq.last.6.mths, pub.rec, and revol.bal. For those we can see one or a small number of large bars on the left, and then a fairly flat graph around zero after that.

To help get a better look at the outliers, below are boxplots for the same variables:

Setting the outlier.alpha to 0.2 to compensate for overplotting, we can now get a better understanding of the four variables with outlier issues. For delinq.2yrs and public.rec, nearly all of the observations are 0, with a small number of outliers at integer values above zero. It looks like these two variables will not be very useful to us.

The inq.last.6mths does have some range, with most observations one of 0, 1, or 2. We can consider taking out the outliers later on, and hopefully this variable will prove useful.

The revol.bal variable actually has a good range, but the outliers are so far out that it is difficult to see. For this variable we’ll want to consider removing outliers and possibly transforming the data by taking the natural log of it, like was already done with the annual.inc variable.

We can take a look at the factor and logical variables with bar charts:

From this we can see that the purpose variable would be a good candidate to perform ANOVA tests on. We can see that for each value of these variables there are at least a few hundred observations, so we should have a large enough sample size in further analysis and statistical tests. At this point we can comfortable convert credit.policy and not.fully.paid to logicals, and purpose to a factor.

Now we can further explore the data to see how the numeric variables differ based on the on the credit.policy, not.fully.paid, and purpose variables. Let’s make some boxplots to visualize this.

2.6 Additional Boxplots

Here we look at the numeric variables for individual sub-categories of the logical and categorical variables. This helps us comprehensively understand how the numerical variables are distributed with respect to each of the logical and categorical values.

Looking at this there are a few numeric variables that look fairly different depending on credit.policy: fico, inq.last.6.mths, and int.rate.

Looking at this there aren’t any variables that visually stand out to a significant degree.

From this we can see some of the purpose categories stand out for certain variables. The debt_consolidation and credit_card purposes stand out for dti and revol.util, while small_business stands out for installment and int.rate. We will perform one-way ANOVA tests later to confirm what we can see here.

2.7 Correlation Plot

The correlation plot below gives an overview of how each of the variables in the dataset may relate to each other. This plot includes every variable except the purpose variable.

This allows us to quantify some of the stand-outs in the additional boxplots such as int.rate, fico, and inq.last.6mths compared to credit.policy. There are also four other correlations that are either greater than 0.4 or less than -0.4 that we want to explore further.

2.8 Scatter Plots

Based on the four variable correlations we have not looked at yet that are greater than 0.4 or less than -0.4, these scatter plots allow us to get a better understanding of those correlations.

Using a point alpha to 0.2 to compensate for overplotting we can see a clear trend of interest rates falling as the borrower’s FICO score goes up, which matches our expectations. All of the FICO scores end in 2 or 7, which is why the results fall into vertical lines.

Here we can see that as the borrower’s interest rate climbs, so does the revolving line utilization rate. This makes sense because we would expect a higher interest rate to be associated with a more risky loan. If the loan is riskier the borrower probably has more difficulty getting credit, and therefore would make use of a higher percentage of the credit they do have available.

As the natural log of the borrower’s annual income increases, we see that the installment of their loan does as well. This also matches our expectations, as those who make more money would likely be able to make higher payments.

We see that as the borrower’s FICO score goes up, their revolving line utilization rate decreases, related to the explanations from the above scatter plots. In the end, this also matches our expectations.

2.9 Last EDA Steps

Before moving into more advanced statistical tests we want to take an initial look at loans that meet the credit underwriting criteria vs the borrower not fully paying. Based on the credit.policy and not.fully.paid variables, we can calculate the percentage of borrowers who did not fully pay based on if they met the credit underwriting criteria.

Meets Credit Policy Percent Not Fully Paid
FALSE 27.8%
TRUE 13.2%

From this we can see that about 13.2% of borrowers who met the credit underwriting criteria did not fully pay, while for the borrowers who did not meet the credit underwriting criteria about 27.8% did not fully pay.

This indicates borrowers who did not meet the credit underwriting criteria were about twice as likely to default on their loans than those who did meet the criteria. For comparison, default rates on loans from commercial banks for the same period as our dataset averaged 4.48%, with a maximum default rate of 7.49% default rate towards the end of 2009, according to the St. Louis Federal Reserve Bank.3

We can confirm that these loans were definitely riskier, especially so if they did not meet the credit underwriting criteria. Based on this a potential lender would be wise to give serious consideration to whether or not the potential borrower meets the credit underwriting policy.

3 Statistical Tests

We conducted and performed Q-Q plots, ANOVAs, and chi-squared tests to examine the variables in our dataset and gain a better understanding of how they interact with each other.

Since our dataset is a subset of all loans facilitated by LendingClub between the years 2007 and 2015, we can treat it as a sample and conduct t-tests which confirm that the means of all the variables of the sample coincide with the means of our population. However, we cannot perform z-tests on this data because we cannot make an estimate about the standard deviation of the population (all loans facilitated by LendingClub).4 5 6 7 8

3.1 Q-Q Plots for Normality Test:

We want to create a Q-Q plot for each numeric variable so we can perform a normality test for each. This reinforces what we was in the histograms during our EDA.

From the plots we see can see that the variables int.rate,log.annual.inc, and fico are the three most normalized variables.

3.2 ANOVA Tests

ANOVA indicates that there is a significant difference in the means of all of our variables for the different categories of purpose, except for one. Subsequent Tukey tests confirm these differences and validate our ANOVA tests. The ANOVA test for the one exception — delinquencies in the past two years — indicates there is no significant difference in its mean for the different categories of purpose. Similarly, subsequent Tukey tests confirm this lack of a difference and validate the ANOVA test.

3.3 Chi-Squared Tests

Chi-square tests confirm that three variables had an association between each other. The p-values between purpose and credit.policy, purpose and not.fully.paid, and credit.policy and not.fully.paid, are less than our chosen significance level of α = 0.05, and therefore we can confidently reject the null hypotheses for these. This indicates there are consequential relationships between these variables.

Chi-square test for purpose vs credit.policy:

Chi-square test for purpose vs not.fully.paid:

Chi-square test for credit.policy vs not.fully.paid:

4 Conclusion

From our analysis of the dataset we find that:

  1. The credit underwriting criteria of LendingClub is proven to be effective as borrowers who do not meet the credit underwriting criteria are more than twice as likely to default in comparison to borrowers who do meet the criteria.
  2. There are some proven relationships between credit.policy and other numeric variables such as int.rate, fico, and inq.last.6mths.
  3. So far there are no clearly established relationships between not.fully.paid and other numeric variables (except for credit.policy).
  4. There are statistically significant relationships between the categorical and logical variables purpose, credit.policy, and not.fully.paid.
  5. For all numeric variables except for delinq.2yrs, their mean significantly varies for different categories ofpurpose.

4.1 Vulnerabilities to Our Analysis

Private individuals historically made up the bulk of lenders in P2P markets. However, high interest rates and the prospects of risky borrowers undermined P2P lending as a legitimate financial industry. Combined with the urge for more growth by intermediaries like LendingClub, these concerns began to prompt higher lending standards and discussions about more regulation. By 2017, larger institutions and banks began to take over private individuals as the primary sources of lending in P2P markets. We assume this shift in P2P lenders altered the makeup of who receives what.

Furthermore, this dataset covers years of very different economic environments. For example, it contains data points from prior to, during, and after the 2008 financial crisis. Since we do not know the exact year of the individual loans we cannot take the time period into consideration, nor can we do time-series analysis to see how it affects the variables.

5 Follow-Up: Revolving Balance

While the annual income data was given to us as a natural log, the revolving balance was given to us unmodified. We discovered that taking the log of of the revol.bal variable gives a better result with a more normal distribution rendering the variable more usable. However, some loans have a revol.bal value of 0 which returns -Inf when the natural log is taken. We will need to revisit this in the future. For now, we want to demonstrate the results of taking the log of revol.baland how it increases the readability of the data.


  1. “Peer to Peer (P2P) Lending Market Size, Report 2022-2030.” Peer to Peer (P2P) Lending Market Size, Report 2022-2030, www.precedenceresearch.com/peer-to-peer-lending-market. Accessed 4 Nov. 2022.↩︎

  2. Peng, R. D., & Matsui, E. (2016). The Art of Data Science: A Guide for anyone who works with data. Skybrude consulting LLC.↩︎

  3. “Delinquency Rate on All Loans, All Commercial Banks.” Delinquency Rate on All Loans, All Commercial Banks (DRALACBN) | FRED | St. Louis Fed, 22 Aug. 2022, fred.stlouisfed.org/series/DRALACBN.↩︎

  4. End to end case study (classification): Lending Club Data. (n.d.). Retrieved November 3, 2022, from https://towardsdatascience.com/end-to-end-case-study-classification-lending-club-data-489f8a1b100a↩︎

  5. Yiu, T. (2019, June 19). Turning Lending Club’s Worst Loans into Investment Gold. Medium. https://towardsdatascience.com/turning-lending-clubs-worst-loans-into-investment-gold-475ec97f58ee↩︎

  6. Lending Club Review: How it Works, Requirements and Alternatives. (n.d.). Debt.org. https://www.debt.org/credit/loans/personal/lending-club-review/↩︎

  7. Project 1: Analysis of Lending Club’s data. (n.d.). Data Science Blog. Retrieved November 3, 2022, from https://nycdatascience.com/blog/student-works/project-1-analysis-of-lending-clubs-data/↩︎

  8. Ph.D, M. K. (2019, April 9). LendingClub: bias in data? Machine learning and investment strategy. Retrieved November 3, 2022, from Medium website: https://michel-kana.medium.com/lendingclub-bias-in-data-machine-learning-and-investment-strategy-3a3bd1c65f0↩︎